2  Numerical Data

We have learnt how to visualise data using tables, which are useful for both analysts and readers when examining a dataset.

However, tables are not always the most effective way to quickly and intuitively understand patterns or trends in the data. This is where figures become valuable.

In this chapter, we will focus on different types of figures that can be used to represent quantitative variables. For plot interpretation, refer to the Inferential Statistics With R short course.

3 Rice

Photo by Zhao Yangjun on Unsplash

In this chapter, we are going to explore the rice dataset.

Description of Data Set Variables
No Variable Description
1 Ref Reference number
2 ExpSite Name of the site/farmer
3 Year Rice sown in October and harvested the following year
4 ExpName Combination of experiment site and year
5 SowingMethod Four sowing methods – Drill, Aerial, Delayed Permanent Water (DPW) and DryBC
6 Range Replicate identification number of the experiment (each treatment has 3 replicates)
7 Row Row identification number (1–24)
8 Rep Same as Range
9 Variety Variety of rice
10 Nitrogen Nitrogen rates (kg N/ha) applied prior to Permanent Water (PW) and at Panicle Initiation (PI)
11 NGroup Groups representing different nitrogen rates
12 Estab Plant population at establishment (No./m²)
13 PI_DM Weight of dry matter at PI (g/m²)
14 PI_Tillers Count of tillers at PI (No./m²)
15 PI_NIR Percentage of nitrogen in tissue at PI (% N)
16 PI_N_Uptake Nitrogen uptake at PI (kg N/ha)
17 PM_DM Weight of dry matter at physiological maturity (g/m²)
18 HarvTillers Count of tillers at PM (No./m²)
19 HI Harvest index (ratio of grain to total plant dry weight)
20 QuadYield Quad (hand-harvested) grain yield (t/ha at 14% moisture)
21 HeaderYield Header (machine-harvested) grain yield (t/ha at 14% moisture)
22 Lodging Lodging score at harvest (1 = standing, 10 = fully lodged)
23 PlantHeight Average plant height at harvest (cm)

Load the rice_data.CSV dataset yourself and store it as a dataframe called rice. Then answer the questions below.

rice <- read.csv("Data-sets/rice_data.csv") str(rice)
rice <- read.csv("Data-sets/rice_data.csv")
str(rice)
  1. How many variables are in the rice dataset?
  2. How many observations are in the rice dataset?
  3. How many categorical variables are there? (Hint: they may not be stored correctly…yet)
  4. How many numerical variables are there? (Hint: they may not be stored correctly…yet)
  5. What type of variable has R stored them?

You may notice that some data cleaning is required.

In the middle of the Excel file (around row 1465), there is a blank row, followed by a repeated header row.

rice[1465:1466,]
     Ref Experiment.site Year Exp.name Sowing.method RANGE ROW REP Variety
1465  NA                                                                  
1466  NA Experiment site Year          Sowing method RANGE ROW REP Variety
     Nitrogen N.group Estab..No.msq. PI.DM..g.msq. PI.Tillers..No.msq.
1465                                                                  
1466 Nitrogen         Estab (No/msq) PI DM (g/msq) PI Tillers (No/msq)
     PI.NIR..N PI.N.uptake..kg.N.ha. PM.DM..g.msq. Harv.tillers..No.msq. HI
1465                                                                       
1466 PI NIR %N PI N uptake (kg N/ha) PM DM (g/msq) Harv tillers (No/msq) HI
     Quad.Yield..t.ha..14.. Header.Yield..t.ha..14.. Lodging..10.flat.
1465                                                                  
1466 Quad Yield (t/ha@ 14%) Header Yield (t/ha@ 14%) Lodging (10=flat)
     Plant.height..cm.
1465                  
1466 Plant height (cm)

In addition, there are several missing values in the dataset, which are denoted by “*”.

sum(rice == "*", na.rm = TRUE)
[1] 1527

There are a few different ways we can approach missing values. The first step is to identify the extent of missing data and then diagnose the issue. Is the data missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR)? In other words, is the issue structural, or is it genuinely random?

Depending on the type of missing data, we might consider:

  • deleting rows with missing values

  • imputing missing values using the mean

  • imputing missing values using MICE (Multivariate Imputation by Chained Equations)

In these short courses, imputation is beyond the scope of the material covered, so we will focus on deleting missing rows (with appropriate justification). To learn how to properly treat missing data, I recommend the Missing Data Imputation in R tutorial from Princeton.

Before exploring the missing data, let’s do a quick clean. There are some trailing whitespace issues in a few observations, and we also want to focus on the * entries and convert these to NA. We then filter out any blank rows and remove the repeated header row.

library (dplyr) clean_rice <- rice %>% # Convert "*" to NA mutate(across(where(is.character), ~ na_if(.x, "*"))) %>% # Drop fully blank rows filter(if_any(everything(), ~ !is.na(.x) & .x != "")) %>% # Drop repeated header row filter(Sowing.method != "Sowing method")
library (dplyr)

clean_rice <- rice %>%
  # Convert "*" to NA
  mutate(across(where(is.character),
                ~ na_if(.x, "*"))) %>%
  # Drop fully blank rows
  filter(if_any(everything(), ~ !is.na(.x) & .x != "")) %>%  
  # Drop repeated header row
  filter(Sowing.method != "Sowing method")  

Let’s visualise the missing values and count how many cells are missing.

sapply(clean_rice, function(x) sum(is.na(x)))
                     Ref          Experiment.site                     Year 
                     312                        0                        0 
                Exp.name            Sowing.method                    RANGE 
                       0                        0                        0 
                     ROW                      REP                  Variety 
                       0                        0                        0 
                Nitrogen                  N.group           Estab..No.msq. 
                       0                        0                       73 
           PI.DM..g.msq.      PI.Tillers..No.msq.                PI.NIR..N 
                      77                       77                       77 
   PI.N.uptake..kg.N.ha.            PM.DM..g.msq.    Harv.tillers..No.msq. 
                      78                       25                       27 
                      HI   Quad.Yield..t.ha..14.. Header.Yield..t.ha..14.. 
                      25                       25                      904 
       Lodging..10.flat.        Plant.height..cm. 
                      86                       53 
sum(is.na(clean_rice))
[1] 1839
library(visdat)
vis_miss(clean_rice)

All treatment variables are complete, while missing values mostly occur in the response and intermediate agronomic measurements. The main example is HeaderYield, which is missing for many observations because machine-harvest data were only collected at some sites. In other words, this is structural missing data i.e. the values do not exist for certain site–season combinations rather than being randomly missing. Because of this, it makes sense to drop HeaderYield NAs only when we are visualising or analysing HeaderYield.

We also see missing values in the reference number (Ref). This appears to be caused by a data import issue, where a blank row and a repeated header row were introduced in the middle of the dataset. After this point, the reference numbers are no longer populated. Since Ref is just an identifier, filling in these values would be inappropriate, so we do not attempt to impute them. Instead, the problematic rows can be removed where needed.

Finally, for visualisation, we do not want to drop all rows that contain missing values, as this would remove valid data unnecessarily. Instead, for each plot, we only remove the missing values that are relevant to the variables being visualised.

We now continue cleaning the data accordingly.

library(tidyverse) library(stringr) clean_rice <- clean_rice %>% mutate(across(where(is.character), str_squish)) %>% mutate(across( c( Ref, Experiment.site, Year, Exp.name, Sowing.method, RANGE, ROW, REP, Variety, Nitrogen, N.group, Lodging..10.flat. ), as.factor )) %>% mutate(across(where(is.character), as.numeric))
library(tidyverse)
library(stringr)

clean_rice <- clean_rice %>%
  mutate(across(where(is.character), str_squish)) %>%
  mutate(across(
    c(
      Ref, Experiment.site, Year, Exp.name, Sowing.method,
      RANGE, ROW, REP, Variety, Nitrogen, N.group, Lodging..10.flat.
    ),
    as.factor
  )) %>%
  mutate(across(where(is.character), as.numeric)) 

Let’s check the structure again to make sure everything has been cleaned up correctly.

str(clean_rice)
str(clean_rice)

Yay! Now that the dataset is cleaned, we’re ready to start plotting.

One Quantitative Variable

When you have a single numerical variable, whether discrete or continuous, a histogram is a great way to visualise its distribution.

In R, you can create plots using base R. For example, to see a visualisation of plant height, you can simply type:

hist(clean_rice$Plant.height..cm.)

to generate a histogram.

While base R plots are fine for basic analysis,ggplot2 is one of the most commonly used plotting packages and is widely accepted in published reports.

The histogram looks a bit plain without any color, and the x-axis label isn’t very informative. Let’s add some color, fix the labels, and give it a clean white background.

Load in the package and then run the code below.

attach(clean_rice)
library(ggplot2) ggplot(data=clean_rice, aes(x=Plant.height..cm.))+ geom_histogram(fill="palegreen3", col="darkgreen", bins=30)+ labs(x="Rice Plant Height (cm)", y="Frequency")+ theme_bw()
library(ggplot2)

ggplot(data=clean_rice, aes(x=Plant.height..cm.))+
  geom_histogram(fill="palegreen3", col="darkgreen", bins=30)+
  labs(x="Rice Plant Height (cm)", y="Frequency")+
  theme_bw()

We can see that this is much better!

If you see an NA warning when creating this histogram in RStudio, don’t worry. R automatically excludes these values from the plot. To avoid the warning, we can explicitly drop the NAs for the variable in the pipeline, as shown below:

clean_rice %>%
  drop_na(Plant.height..cm.) %>%
  ggplot(aes(x = Plant.height..cm.)) +
  geom_histogram(fill = "palegreen3", col = "darkgreen", bins = 30) +
  labs(x = "Rice Plant Height (cm)", y = "Frequency") +
  theme_bw()

It is very important that when creating a plot, you provide meaningful labels with units. You can also see below how different bin choices influence the histogram.

ggplot(data=clean_rice, aes(x=Plant.height..cm.))+
  geom_histogram(fill="palegreen3", col="darkgreen", bins=10)+
  labs(x="Rice Plant Height (cm)", y="Frequency")+
  theme_bw()

ggplot(data=clean_rice, aes(x=Plant.height..cm.))+
  geom_histogram(fill="palegreen3", col="darkgreen", bins=100)+
  labs(x="Rice Plant Height (cm)", y="Frequency")+
  theme_bw()

We can make the histogram even more informative by filling the bars by sowing method. Do plant heights differ between different sowing methods?

ggplot(data=clean_rice, aes(x=Plant.height..cm., fill=Sowing.method))+
  geom_histogram(binwidth=2)+
  labs(x="Rice Plant Height (cm)", y="Frequency")+ 
  theme_bw()

ggplot(data=clean_rice, aes(x=Plant.height..cm., col=Sowing.method))+
  geom_freqpoly(binwidth=2)+
  labs(x="Rice Plant Height (cm)", y="Frequency")+ 
  theme_bw()

ggplot(data=clean_rice, aes(x=Plant.height..cm., after_stat(density), col=Sowing.method))+
  geom_freqpoly(binwidth=2)+
  labs(x="Rice Plant Height (cm)", y="Frequency")+ 
  theme_bw()

Two Quantitative Variables

When working with two quantitative variables, scatterplot is a suitable choice.

Fill in the blanks below to create a plot of dry matter at PI against plant height.

ggplot(data=clean_rice, aes(x=PI.DM..g.msq., y=Plant.height..cm.))+ geom_point(fill="darkolivegreen3", col="darkolivegreen", pch=21, size=3)+ labs(x=expression("Weight of Dry Matter at PI ("*g/m^2*")"),y="Rice Plant Height (cm)")+ geom_smooth(method="lm", col="black")+ theme_bw()
ggplot(data=clean_rice, aes(x=PI.DM..g.msq., y=Plant.height..cm.))+
  geom_point(fill="darkolivegreen3", col="darkolivegreen", pch=21, size=3)+
  labs(x=expression("Weight of Dry Matter at PI ("*g/m^2*")"),y="Rice Plant Height (cm)")+
  geom_smooth(method="lm", col="black")+
  theme_bw()

We can see that as the weight of the dry matter increases, the plant height also slightly increases. This suggests a weak to moderate positive linear relationship. Looking for more details on understanding linear relationships? Check out the Inferential Statistics with R and Linear Regression short courses!

To visualise the linear relationship, we include method = "lm" in geom_smooth(), which fits a linear model (a straight line) to the data.

ggplot(data=clean_rice, aes(x=PI.DM..g.msq., y=Plant.height..cm.))+
  geom_point(fill="darkolivegreen3", col="darkolivegreen", pch=21, size=3)+
  labs(x=expression("Weight of Dry Matter at PI ("*g/m^2*")"),y="Rice Plant Height (cm)")+
  geom_smooth(method="lm", col="black")+
  theme_bw()

We can customise the scatterplot even further by using the mapping argument to colour the points by sowing method.

clean_rice %>%
  drop_na(Sowing.method) %>%
  ggplot(aes(
    x = PI.DM..g.msq.,
    y = Plant.height..cm.,
    colour = Sowing.method,
    fill   = Sowing.method,
  )) +
  geom_point(size = 3) +
  labs(
    x = expression("Weight of Dry Matter at PI ("*g~m^-2*")"),
    y = "Rice Plant Height (cm)"
  ) +
  theme_bw()

Quantitative and Qualitative Variables

You may also want to create plots that compare quantitative and qualitative variables. For example, does plant height differ between rice varieties? How does header yield vary across sowing methods? In these cases, a boxplot is a useful way to visualise the relationship.

Fill in the blanks below to create a boxplot showing plant height (Plant.height..cm.) across groups representing different nitrogen rates (N.group).

clean_rice$N.group <- factor( clean_rice$N.group, levels = c("zero", "PW_1", "PW_2", "PW_3", "split", "split 2", "split 3") ) ggplot(clean_rice, aes(x=N.group, y=Plant.height..cm.))+ geom_boxplot(fill="#EF8354", col="#DF3B57")+ labs(x="Nitrogen Rate Group",y="Rice Plant Height (cm)")+ theme_bw()
clean_rice$N.group <- factor(
  clean_rice$N.group,
  levels = c("zero", "PW_1", "PW_2", "PW_3", "split", "split 2", "split 3")
)

ggplot(clean_rice, aes(x=N.group, y=Plant.height..cm.))+
  geom_boxplot(fill="#EF8354", col="#DF3B57")+
  labs(x="Nitrogen Rate Group",y="Rice Plant Height (cm)")+
  theme_bw()

You can also customise even further.

ggplot(
  clean_rice,
  aes(x = N.group, y = Plant.height..cm., fill = N.group, colour = N.group)
) +
  geom_boxplot(col="black") +
  facet_wrap(~ Sowing.method) +
  scale_fill_manual(values =  c(
  "#D8A47F",
  "#EF8354",
  "#EE4B6A",
  "#C57A44",
  "#EA5C1F",
  "#EA1F44",
  "#0F7173")) +
  labs(
    x = "Nitrogen rate group",
    y = "Rice plant height (cm)",
    title = "Rice Plant Heights vs. Nitrogen Rate by Sowing Method"
  ) +
  theme_bw() +
  theme(legend.position = "none")

Try creating a boxplot of Header Yield (machine-harvested grain yield, tonnes per hectare at 14% moisture) versus sowing method on your own, and customise it to your liking.

clean_rice %>%
  drop_na(Header.Yield..t.ha..14..) %>%
ggplot(aes(x = Sowing.method, y = Header.Yield..t.ha..14.., fill = Sowing.method, 
                 colour = Sowing.method)) +
  geom_boxplot() +
  scale_fill_manual(values = c("#A05475", "#EE586C", "#6B5C8A")) +
  scale_colour_manual(values = c("#3B1F2B", "#D8162F", "#3F355F")) +
  labs(x = "Sowing Method", y = "Header Yield (Tonnes per Hectare*)",
       title="Sowing Method vs. Header Yield by Nitrogen Group",
       caption = "*At 14% Moisture") +
  theme_classic() +
  facet_wrap(~N.group)

detach(clean_rice)

4 Palmer Penguins

Now that we’ve learned a bit about plotting, let’s test your new skills using the penguins dataset we worked with in the previous chapter. Try completing the exercises below.

library(ggplot2) library(palmerpenguins) library(visdat) vis_miss(penguins)
library(ggplot2)

library(palmerpenguins)

library(visdat)

vis_miss(penguins)

What do you notice about the missing data, and how might it be handled appropriately?

There is minimal missing data overall (approximately 0.7%). Although the variable sex has the highest proportion of missing values, the amount is small and does not show an obvious pattern with other variables in the dataset. As a result, removing these observations is unlikely to have a meaningful impact on the overall dataset.

clean_penguins <- na.omit(penguins) ggplot(clean_penguins, aes(x=body_mass_g))+ geom_histogram(fill="orangered", bins=30) + labs(x="Body Mass (g)")+ theme_bw()
clean_penguins <- na.omit(penguins)

ggplot(clean_penguins, aes(x=body_mass_g))+
  geom_histogram(fill="orangered", bins=30) +
  labs(x="Body Mass (g)")+
  theme_bw()
#1. ggplot(data=clean_penguins, aes(x=body_mass_g, col=species))+ geom_freqpoly(binwidth=100)+ labs(x="Body Mass (g)", y="Frequency")+ theme_bw() #2. ggplot(data=clean_penguins, aes(x=body_mass_g, after_stat(density), col = species)) + geom_freqpoly(binwidth = 100)+ labs(x="Body Mass (g)", y="Frequency")+ theme_bw()

#1. 
ggplot(data=clean_penguins, 
       aes(x=body_mass_g, col=species))+ 
  geom_freqpoly(binwidth=100)+
  labs(x="Body Mass (g)", y="Frequency")+ 
  theme_bw()

#2. 
ggplot(data=clean_penguins, 
       aes(x=body_mass_g, 
           after_stat(density), 
           col = species)) + 
  geom_freqpoly(binwidth = 100)+
  labs(x="Body Mass (g)", y="Frequency")+ 
  theme_bw()

Scatterplots

What if we’re interested in how body mass relates to flipper length? In this case, we’re working with two continuous numerical variables, so a scatterplot would be a good choice.

ggplot(clean_penguins, aes(x=flipper_length_mm, y=body_mass_g))+ geom_point(fill="aquamarine", col="aquamarine4", pch=21, size=2.5)+ labs(x="Flipper Length (mm)", y="Body Mass (g)")+ theme_bw()

ggplot(clean_penguins, aes(x=flipper_length_mm, y=body_mass_g))+
  geom_point(fill="aquamarine", col="aquamarine4", pch=21, size=2.5)+
  labs(x="Flipper Length (mm)", y="Body Mass (g)")+
  theme_bw()
ggplot(clean_penguins, mapping = aes(x=flipper_length_mm, y=body_mass_g)) + geom_point(mapping = aes(color = species)) + geom_smooth(method = "lm") + labs(x="Flipper Length (mm)", y="Body Mass (g)")+ theme_bw()
ggplot(clean_penguins,
  mapping = aes(x=flipper_length_mm, y=body_mass_g)) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = "lm") +
  labs(x="Flipper Length (mm)", y="Body Mass (g)")+
  theme_bw()
ggplot( data = clean_penguins, mapping = aes(x=flipper_length_mm, y=body_mass_g) ) + geom_point(mapping = aes(color = species, shape = species)) + geom_smooth(method = "lm") + scale_color_manual( values = c( Adelie = "#03045e", Chinstrap = "#0096c7", Gentoo = "#90e0ef"))+ labs(x="Flipper Length (mm)", y="Body Mass (g)")+ theme_bw()
ggplot(
  data = clean_penguins,
  mapping = aes(x=flipper_length_mm, y=body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  scale_color_manual(
    values = c(
      Adelie = "#03045e",
      Chinstrap = "#0096c7",
      Gentoo = "#90e0ef"))+
  labs(x="Flipper Length (mm)", y="Body Mass (g)")+
  theme_bw()

Boxplots

ggplot(clean_penguins, aes(x=species, y=body_mass_g))+ geom_boxplot(fill = c("coral2", "chartreuse2", "cadetblue2"), col = c("coral3", "chartreuse4", "cadetblue4")) + labs(x="Species of Penguin", y="Body Mass (g)")
ggplot(clean_penguins, aes(x=species, y=body_mass_g))+
  geom_boxplot(fill = c("coral2", "chartreuse2", "cadetblue2"),
               col = c("coral3", "chartreuse4", "cadetblue4")) +
  labs(x="Species of Penguin", y="Body Mass (g)")

Provide an interpretation of the above plot.

We can see that Gentoo penguins have a noticeably higher median body mass compared to Chinstrap and Adelie penguins. The interquartile range (the box) for the Gentoo species does not overlap with the others, suggesting there may be a significant difference between them.

Fantastic work!

5 Additional Notes

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island)) +
  scale_colour_brewer(palette = "Set1")

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island) +
  scale_colour_viridis_d()

Colour blind

library(ggthemes)
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island) +
  scale_colour_colorblind()

library("scales")

show_col(colorblind_pal()(8))